Towards a Strategy for a Representation of Collocations - Extending the Danish PAROLE-lexicon

نویسندگان

  • Anna Braasch
  • Sussi Olsen
چکیده

We describe our attempts to formulate a pragmatic definition and a partial typology of the lexical category of ’collocation’ taking both lexicographical and computational aspects into consideration. This provides a suitable basis for encoding collocations in an NLP-lexicon. Further, this paper explains the principles of an operational encoding strategy which is applied to a core section of the typology, namely to subtypes of verbal collocation. This strategy is adapted to a pre-defined lexicon model which has been developed in the PAROLE-project. The work is carried out within the framework of the STO-project the aim of which is to extend the Danish PAROLE-lexicon. The encoding of collocations, in addition to single-word lemmas, greatly increases the lexical and linguistic coverage and thereby also the usability of the lexicon as a whole. Decisions concerning the selection of the most frequent types of collocation to be encoded are made on empirical data i.e. corpusbased recognition. We present linguistic descriptions with focus on some characteristic syntactic features of collocations that are observed in a newspaper corpus. We then give a few prototypical examples provided with formalised descriptions in order to illustrate the restriction features. Finally, we discuss the perspectives of the work done so far. 1. About the STO project The aim of the dictionary project STO is to develop a large-scale Danish lexicon for language technology applications using the Danish PAROLE-lexicon consisting of 20,000 general language entries as the point of departure. The establishment of the descriptive model and the linguistic specifications for STO greatly benefits from the experience acquired in the LE-PAROLE work. The lexicon will contain approx. 45,000 general and specialised language entries including semantic information part of which will be based on reuse of data and specifications from the SIMPLE-project. These will result in approx. 100,000 semantic readings (meanings). 1 SprogTeknologisk Ordbog, literally ’Language Technology Lexicon’, i.e. a Danish lexicon for NLP applications. A project initiated by Center for Language Technology in Copenhagen (Braasch et al. 1998). 2 The LE-PAROLE-project (Preparatory Action for linguistic Resources Organisation for Language Engineering) 1996-1998, developed NLP lexicons for 12 European languages provided with morphological and syntactic information. 3 The LE-SIMPLE-project (Semantic Information on Multifunctional Plurilingual Lexica) extends the PAROLElexica with semantic information. 2. Lexicographic and computational aspects in combination A considerable number of lexical units in a text are recurring bound word combinations. With the exception of valency patterns, these have until now not been incorporated into the STO lexicon. In order to extend the lexical and linguistic coverage, one of the most important tasks is to encode in the lexicon such word combinations, including collocations. To this end we have to set up a classification and want to develop an encoding strategy that accounts for the specific linguistic properties of collocation types and is compatible with the descriptive model used for single-word lemmas. In all practical lexicography, one of the most discussed topics is the appropriate selection and description of lexical units that consist of more than a single word. It is well-known that they frequently cause problems not only for language learners but also for native speakers because bound word combinations cannot be understood or produced by using general rules of the language, i.e. they are complex units that cannot be treated fully compositionally (Moon, 1992; Heid, 1998). They can be regarded as coherent and (more or less) lexicalised building blocks of the language and thus they belong to the vocabulary. The lexicalisation of word combinations is a process of step-by-step progression which is influenced by different factors. The process results in a large number of cohesion types that can be classified along various axes (see e.g. in Benson et al. 1986; Alexander 1992). In this connection, lexicographers are concerned with the following basic questions: • what kinds of word combinations should be in the dictionary • where is their proper position in the macroand microstructure of the dictionary • with which linguistic information should they be described. In natural language processing (henceforth NLP) the property of non-compositionality is a crucial, but until now less elaborate, task to cope with. Generally, NLP systems are based on linguistic rules and regular patterns which describe the predictable and systematic behaviour of language; supplementary non-predictable behaviour and arbitrary choices are treated as exceptions to these rules. Linguistic information represented in a lexicon for NLP applications must be very detailed, unambiguous, explicit, exhaustive and formalised. Therefore, for NLP systems, e.g. for machine translation, the lexicographer has to consider some additional questions originating from the specific requirements of computational applications. In the present lexicon project a further essential aspect must be considered: The description of all lexical unit types must fit into the fixed PAROLE-model, and the linguistic specifications for STO (although they still are modifiable) must be followed. In this sense, morphological and syntactic patterns (including valency frames) that are already encoded must be reused in the encoding of new lexical entries. 3. The PAROLE-model of lexical description As the point of departure we work with the PAROLEmodel in a version that has been slightly modified for Danish. The model has originally been developed in the GENELEX-project and was reused in an extended version in PAROLE. It has a modular architecture comprising three independent, but linked layers of description according to a traditional division of linguistic information into morphological, syntactic and semantic types. The model is generic without a declared commitment to a particular linguistic theory. However, it is heavily inspired by the unification-based theory of Head-Driven Phrase Structure Grammar (Pollard & Sag 1994) which makes use of only very few grammar rules. All important syntactic and semantic processes are driven by information contained in the lexical entries. One of the implications of the modularity is that linguistic behaviours of words are described independently and based purely on features observable at the particular levels in terms of morphological, syntactic and semantic units. A morphological unit contains the exhaustive description of inflection, information on part-of-speech, spelling variants and a few more properties. A syntactic unit contains information about the syntactic structures compatible with the lemma including valency, raising/control. Other syntactic properties of its prototypical syntactic environment can also be described here. The semantic level is not instantiated yet in our lexicon. Morphological and syntactic units are linked to each other according to their connection with the particular lemma. Thus, this model does not operate with a pre-defined lexical unit similar to that in paper dictionaries. However, a ‘dictionary entry’ containing the lemma with all represented morphological and syntactic (and semantic) information can be compiled from the relevant units of the three layers of description. This description method has the advantage of not being static with regard to a presentation of the lexical item together with all related information in a single dictionary entry. In paper dictionaries, information is only linearly accessible beginning from the top of the entry. Decisions regarding the representation of fixed expressions and collocations as lemmas or sublemmas in the structure of the lexicon are therefore in our context not of primary theoretical relevance, confer the discussion in Moon (1992, esp. pp.501-502) and Heer Henriksen (1995). On the one hand, by using appropriate facilities of the database wherein the lexical data are stored (ORACLE), it is possible to link, to fetch and to present information from the three layers of the lexicon in several ways. On the other hand, from the practical point of view it is necessary to decide on systematic solutions. In the case of totally invariable word combinations it is appropriate to treat them in the same way as simple lexemes, i.e. as units of the morphological layer. In the same way a systematic treatment of bound word combinations, i.e. complex lexemes, must be decided on. 4. Criteria for discerning free and bound word combinations Concordances produced by using the corpus tool XKWIC (see section 5 below) provide us with information about lexical co-occurrences in our corpus. The starting point is to study the findings in the concordances from two points of view. In computational corpus research, the statistical view on the frequency of word co-occurrences (see e.g. Sinclair 1991, p.109 ff) is the most prevalent one. The significance of co-occurrences of two or more words within a given 'collocational span' shows the degree of mutual affinities between these words. This quantitative criterion is very important, but used alone it would result in a too broad definition of the term 'collocation' which is inappropriate for practical lexicographical work. When used in combination with linguistic criteria that are more or less commonly agreed on, it provides a firm basis for pragmatic decisions (discussed e.g. in Cowie 1983; Cruse 1986; Benson 1986). A preliminary definition of a bound word combination is formulated as follows: a frequently co-occurring word combination of two or more components showing a certain degree of structural and meaning cohesion. Frequent co-occurrences of words range from free word combinations over bound word combinations with increasing internal affinity and cohesion to fully frozen units. Figure 1 (below) shows a classification of co-occurrences, deliberately oversimplified for illustration purposes; it is worth noting that there are many overlaps and probably also gaps between the categories mentioned below. (Free comb) ... Valency structure ... Collocation ... Multi-word term ... Formula ... Idiom ... Fully frozen expr. (0 cohesion) ← (increasing affinity between the components... ) (max. cohesion) → Figure 1: Internal cohesion of co-occurring words seen as a cline With reference to the terms used in this classification, we deal with the class of collocations. Another terminology is used e.g. in Benson (1986), where collocation is considered a wider term for grammatical collocations (in our classification: valency structures) and lexical collocations (in our classification: collocations). The word combinations extracted from our corpus are very heterogenous wrt their internal structure, syntactic function, degree of fixedness, semantic transparency, etc. In our case, the most important properties to be taken into consideration are restrictions on syntactic and lexical variability which basically differentiate bound word combinations from free combinations. In this respect, it is also important to discern bound word combinations consisting of a verb and a prepositional phrase from valency instances of a verb, having particularly strong subcategorisation and selectional restrictions. The examples below illustrate that collocations (1) and (2) look very similar to instances of valency (3) and (4) on the surface: (1) tage til genmæle ’reply’ (lit.: take to reply) (2) tage [ngt.] i øjesyn ’inspect [smth.]’ (lit.: take [smth.] into eye’s view) (3) tage til Berlin / i sommerhuset / på indkøb ‘go to Berlin /to the summer house /shopping’ (lit.: take to Berlin/ in the sommer house / on shopping) (4) tage [ngt.] i skuffen / fra skabet ‘take / get [smth.] from the drawer/ from the closet’ A valency structure contains a content word (verb, noun or adjective) and a grammatical structure (i.e. prepositional phrase, infinitive, finite or infinite clause) that the content word subcategorises for. Lexical entries in the STO lexicon contain a description of their individual subcategorisation requirements expressed in formalised valency patterns. Navarretta (1997) describes the development of valency descriptions of Danish verbs within the PAROLE-model. This method provides core syntactic information about structural compatibility in a simple and economical way. Collocations consist of (groups of) content words (nouns, verbs, adjectives and adverbs) where one of the constituents typically carries the meaning and is the syntactically and semantically fixed part (base), while the other one has a weak meaning (collocate) and can be interchanged e.g. with a synonym or an antonym. A rather comprehensive task is to deal with the lexical compatibility of words that occur in collocations because the choice of the semantically weak constituent is arbitrary and not predictable. In combination with the restrictions on lexical compatibility, collocations often show restricted internal variation of inflection and structure compared to parallel free word combinations. In the following, we discuss the linguistic features of collocations that we regard as useful criteria for a subclassification and for the selection of frequent collocation types to be dealt with. Basically, collocations are semantically transparent because of the recognised meaning of the base i.e. semantic core of the expression. However, going through our list of collocation candidates we experienced that the degree of transparency can vary quite a lot, therefore it is by no means straightforward to use semantic cohesion of co-occurring words as a primary classification criterion. Therefore, we concentrate our investigations on the following properties: • syntactic label of the whole collocation (i.e. phrase type: VP, NP, ADJP or ADVP) • part-of-speech (or syntactic label, if appropriate) of both constituents (base and collocate) Additionally, it is necessary to check whether the collocation contains a unique component (not existing as independent lemma outside the collocation e.g. øjesyn) in order to ensure that all constituent words are encoded in the lexicon as single-word lemmas for reasons of searchability. 5. An outline of the practical work Our investigation into recurring bound word combinations is based on two Danish corpora. The first and largest one comprises 20 mill. tokens from newspaper texts, the second one is a corpus of 4 mill. tokens from newspapers, magazines and books. None of the corpora are part-ofspeech-tagged nor lemmatised, therefore the processing of corpus evidences involves several manually controlled steps, e.g. the manual partitioning of concordances into subsets based the part-of-speech information. Extension of the available corpora as well as tagging of the corpora is in progress. We use the XKWIC corpus tool (Christ 1993) for the corpus investigations. In order to provide guidelines for encoding of collocations, we divided the practical work into the following sub-tasks: • automatically producing concordances of common nouns and verbs that are already encoded in our lexicon and where the lexicographer noted in a comment that they occur in a great number of recurrent word combinations • compiling lists of collocation candidates on the basis of these concordances with various sorting aspects to detect frequent collocation types • manually selecting and extracting a few types for detailed analysis • comparing the findings with the descriptive model and deciding on an appropriate description strategy • setting up initial guidelines for linguistic description of the types selected • starting testing and refining/extension-cycle The core task was to select the most frequent types from the list of collocation candidates that are classified in terms of the following properties • syntactic label of the collocation type: verbal phrase • the part-of-speech of the base is noun/nominal phrase Vcoll = V+N/NP prepositional phrase (PP) Vcoll = V + PP It is important to note in this connection, that in Danish the canonical word order of constituents in these types is: collocate – base, which is similar to English but different from German: (5) tage del i[ ngt] ‘take part in [sth]’, ‘an [etw.+D] Anteil nehmen’ 6. Representation of collocations in the PAROLE-model In the PAROLE-project the encoding of single-word lexical items was in focus, and to our knowledge no attempts have been made yet by other language groups to encode complex lexical items, although the model is prepared also for this task. The model has its advantage in being very detailed and explicit and is provided with a comprehensive descriptive language. 6.1 Description with focus on the syntactic level The following linguistic features are regarded as having primary relevance for the description of collocations • complex structure containing at least one autosemantic (content) word • restricted morphological variability of the components compared with their free occurrences; • restricted (morpho-)syntactic variability; • a certain degree of meaning cohesion (restricted transparency). In addition, a syntactico-semantic feature can be made explicit: collocations can function in texts similarly to single-word units. Monolingually a collocation, e.g. stille krav 'make a demand' (lit.: 'set demand') can often be substituted by a single word synonym kræve ’demand’; they are also often translated into a simple target language lexeme. In this paper, we do not discuss purely semantic features like the base-collocate relationships, and their impact on the semantic part of the description is only briefly mentioned. The features above can combine in several different ways and they are almost inseparably bound to each other which makes a strictly modular desription a cumbersome task. Therefore it is useful to develop a method based on extensive use of patterns in order to describe (morpho) syntactic features of collocations piece by piece. A pattern is in this sense a generalised description of a particular linguistic behaviour consisting of a unique combination of relevant information pieces which are expressed in terms of feature-value pairs. This is consistent with the method used for description of inflectional behaviours (we have implemented approx. 550 patterns) and for syntactic behaviours (approx. 700 patterns). A pattern in our model may describe one single, several or a large number of instances. (A pattern having just one single instance describes an exceptional behaviour.) 6.2 Towards a formalisation of syntactic restriction information In the following section, we give a number of simplified examples in order to illustrate a pattern construction procedure. The linguistic properties described in these examples are recognised for each of the selected search words in a large number of corpus occurrences. One of the frequent Danish verbs, tage ‘to take’ has in its various inflected forms roughly 29,000 instances, of which the most frequent eight collocations make a total of approximately 8,000 occurrences, including the collocation tage ansvar ’take/shoulder the responsibility’ with 3,128 occurrences. However, we are aware of the fact that such findings have rather limited value because of the size and the composition of the corpus (mainly newspaper texts). Below, we focus on a few restriction types that affect subtypes of verbal collocations (Vcoll) in different ways.

برای دانلود رایگان متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

STO: A Danish Lexicon Resource - Ready for Applications

$EVWUDFWW This paper deals with the STO lexicon, the most comprehensive computational lexicon of Danish developed for NLP/HLT applications, which is now ready for use. Danish was one of the 12 EU-languages participating in the LE-PAROLE and SIMPLE projects; therefore it was obvious to continue this work building on our experience obtained from these projects. The material for Danish produced wi...

متن کامل

Simple - Semantic Information For Multifunctional Plurilingual Lexica: Some Examples Of Danish" Concrete Nouns

SIMPLE is a large-scale Emopean lexicon project funded by the European Commlssmn with the partlctpat~on ot 12 European countries The mm of the project is to add harmomzed semantm mtormatlon to the LE-PAROLE lexicons 1, which contain motphological and syntactic information In this paper we present some examples of concrete nouns trom the Danish SIMPLE lexicon which illustrate two central aspects...

متن کامل

Current Developments of STO - the Danish Lexicon Project for NLP and HLT Applications

The Centre for Language Technology (Center for Sprogteknologi, CST) is in charge of a national project developing a large-scale Danish lexicon for HLT and NLP applications. The short name of the project is STO, which stands for SprogTegnologisk Ordbase (Lexical Database for Language Technology). The project is inspired by principles and methods applied in the multilingual LEPAROLE project (1996...

متن کامل

Mental Representation of Cognates/Noncognates in Persian-Speaking EFL Learners

The purpose of this study was to investigate the mental representation of cognate and noncognate translation pairs in languages with different scripts to test the prediction of dual lexicon model (Gollan, Forster, & Frost, 1997). Two groups of Persian-speaking English language learners were tested on cognate and noncognate translation pairs in Persian-English and English-Persian directions with...

متن کامل

Semantic Encoding of Danish Verbs in SIMPLE - Adapting a Verb Framed Model to a Satellite-framed Language

In this paper we give an account of the representation of Danish verbs in the semantic lexicon model, SIMPLE. Danish is a satellite-framed language where prepositions and adverbial particles express what in many other languages form part of the meaning of the verb stem. This aspect of Danish – as well as of the other Scandinavian languages challenges the borderlines of a universal, strictly mod...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2000